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Abstract 

This correspondence studies an estimator of the conditional support 
of a distribution underlying a set of i.i.d. observations. The relation 
with mutual information is shown via an extension of Fano's theorem 
in combination with a generalization bound based on a compression 
argument. Extensions to estimating the conditional quantile interval, 
and statistical guarantees on the minimal convex hull are given. 

Keywords: - Statistical Learning, Fano's inequality, Mutual In- 
formation, Support Vector Machines 

1 Introduction 

Given a set of paired observations T> n = {(Xi,Yi)}^ =1 C M. d x M. which are 
i.i.d. copies of a random vector (X, Y) possessing a fixed but unknown joint 
distribution Fxyi this letter concerns the question which values the random 
variable Y can possibly/likely take given a covariate X. This investigation 
on predictive tolerance intervals is motivated as one is often interested in 
other characteristics of the joint distribution than the conditional expecta- 
tion (regression): e.g. in econometrics one is often more interested in the 
volatility of a market than in its precise prediction. In environmental sci- 
ences one is typically concerned with the extremal behavior (i.e. the min 
or max value) of a magnitude, and its respective conditioning on related 
environmental variables. 

The main contribution of this letter is the extension to Fano's classical 
inequality (see e.g. [1], p. 38) which gives a lower-bound to the mutual 
information of two random variables. This classical result is extended to- 
wards a setting of learning theory where random variables have an arbitrary 
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fixed distribution. The derivation yields a non-parametric estimator of the 
mutual information possessing a probabilistic guarantee which is derived 
using a classical compression argument. The described relationship differs 
from other results relating estimators and mutual information as e.g. using 
Fisher's information matrix [1] or based on Gaussian assumptions as e.g. 
in [2], as a distribution free context is adopted. As an aside, (i) an esti- 
mator of the conditional support is derived and is extended to the setting 
of conditional quantiles, (ii) its theoretical properties are derived, (iii) the 
relation to the method of the minimal convex hull is made explicit, and (iv) 
it is shown how the estimate can be computed efficiently by solving a linear 
program. 

While studied in the literature e.g. on quantile regression [3], we argue 
that this question can be approached naturally from a setting of statistical 
learning theory, pattern recognition and Support Vector Machines (SVM), 
see [4, 5] for an overview. A main conceptual difference with the existing 
literature on classical regression and other predictor methods is that no at- 
tempt is made whatsoever to reveal an underlying conditional mean (as in 
regression), conditional quantile (as in quantile regression), or minimal risk 
point prediction of the dependent variable (as in pattern recognition). Here 
we target instead (the change of) the rough contour of the conditional dis- 
tribution. This implies that one becomes interested in (i) to what extent 
the estimated conditional support of the tube is conservative (i.e. does it 
overestimate the actual conditional support?), and (ii) what is the proba- 
bility of covering the actual conditional support (i.e. to what probability a 
new sample can occur outside the estimated interval). 

Section II proofs the main result, and explores the relation with the con- 
vex hull. From a practical perspective, Section III provides further insight 
in how the optimal estimate can be found efficiently by solving a linear 
program. 

2 Support and Quantile Tubes 
2.1 Support Tubes and Risk 

Definition 1 (Support and Quantile Tubes) Given a set of data V n 
which are sampled i.i.d. from a fixed but unknown joint distribution Fxy ■ 
Let 7i\ C {m : R d — > R} and H2 C {s : M. d — > M + } be proper function spaces 
where the latter is restricted to positive functions and TL2 C 7~Ci ■ Let p(M) be 
the powerset o/R such that p(M.) = {V C K}. The class of tubes r(?ii, W2) 
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Figure 1: Example of a support vector tube based on a finite sample of a bi- 
variate random variable (X, Y). A tube T m ^ s is defined as the conditional interval 
%nA x ) = [m{X) - s(X), m(X) + s(X)] with width 2s(x). 

is defined as 



abbreviated as T m ^ s = m±s. A tube T m ,s € r (7^1,7^2) is a true support tube 
(ST) of a joint distribution Fxy if the equality P(Y G T mtS (X)) = 1 holds. 
Similarly a tube F m ,s € r(7Yi,W 2 ) is a true quantile tube (QT) for Fxy of 
level < a < 1 if P(Y € T m , s {X)) > 1 - a. 

Let the indicator I(Y £ T m s (X)) be equal to one if Y £ T m ^ s {X) and zero 
otherwise. We define the risk of a candidate ST for given joint distribution 
as follows 



where the expectation is taken over the random variables X and Y with 
joint distribution Fxy- Its empirical counterpart becomes TZ n (T mtS ;V n ) = 
n ELi " (^« &%n,s(Xi)). The study of support tubes based on empirical 
samples will yield bounds of the form 




T m;S (x) = \m{x) - s(x), m(x) + s(x)]} (1) 



n{T m)S ; Fxy) = E [I (Y £ T m , t (X))] = P (Y # T mjS (X)) 



(2) 



PI sup H{T m:S ;Fx Y )>e < r/(e ; r(Wi, W 2 )) 



(3) 
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where 0<1 — e<lis the probability of covering the tube and where the 
function r/(-; r(7ii, H2)) '■ [0, 1] — > [0,1) expresses the confidence level in the 
probability of covering. 

2.2 Generalization Bound 

For now, we focus on the case of the ST, extensions specific to the QT are 
described in the next subsection. Assume a given hypothesis class r(Hi, H2) 
of STs. Consider an algorithm constructing a ST - say T m)S - with zero 
empirical risk lZ n {T m)S ; T> n ) = 0. The generalization performance can be 
bounded using a geometrical argument which was also used for deriving the 
compression bound outlined in [6], [7], and refined in various publications 
as e.g. [8]. 

Theorem 1 (Compression Bound on Risk of a ST) Let V n be i.i.d. 

sampled from a fixed but unknown joint distribution Fxy- Consider the 
class of tubes T where each tube T m>s is uniquely determined by D appropri- 
ate samples (i.e., T m)S can be 'compressed' to D samples). Let no = n — D 
denote the number of remaining samples. Then, with probability exceeding 
1—5 < 1, the following inequality holds for any T m<s where TZ n {T m y, T> n ) = 0: 

sup K{T m y,F X Y) < „ = e (^-°' n )' ( 4 ) 

K n (T m , s ;V n )=0 n-u 

where we define K nj £)(T) as 

W)=(„)p D - 1 -D<(f) D . (5) 

Proof: At first, fix a ST determined by D samples - say the first D 
samples {(X\,Yi), . . . , (Xn, Yd)} - denoted as %^ s . Assume Fxy is such 
that the actual risk of this tube is larger than a given value < e < 1 such 
that TZ{F^ s ]Fxy) > e. Then the chance that the remaining n — D i.i.d. 
samples {(Xd+i Yd+i), • • • , {X n , Y n )} are by chance consistent with s , is 

lower than U?=d+i P i Y i G T m,si. X ^) - ^ ~ e ) n ~° ■ This can be bounded 
as follows 

P {11{T° S - Fxy) > e) < (1 - e) n - D < e^ n ~ D >, (6) 

making use of the classical binomial bound, see e.g. [5]. The finite number of 
tubes which can be compressed without loss of information to D points can 
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be bounded using a geometrical argument. Given D points, every point can 
be used to interpolate either the upper- function m + s, or the lower-function 
m — s. However, switching the assignments of all points simultaneously leads 
to the same ST, and the case of all points assigned to the same (upper- or 
lower-) function does not result in a unique tube neither. Therefor, the 
number of ST which can be determined using D samples out of n - denoted 
as K n: £)(X) ' can be bounded as follows: 




where the inequality Q^) < (jy) D of the binomial coefficient is used. Com- 
bining (J6]) and ([5]), and inverting the statement as classical proofs the result. 

□ 

A crucial element for this result is that it is known a priori that such 
a tube with zero empirical risk exists independently from the data at hand 
(realizable case), this assumption is fulfilled by construction. Although com- 
binatorial in nature (any found hypothesis T should be determined entirely 
by a subset of D chosen examples), it is shown in the next section how this 
property holds for a simple estimator which can be estimated efficiently as 
a standard linear program. 

Example 1 (Tolerance level) The following example indicates the prac- 
tical use of this result: given n = 200 i.i.d. samples with a correspond- 
ing class of hypotheses each determined by three samples (D = 3 and thus 
K Ht £)(T) < 3 * W 8 ). Fixing the tolerance level as 5 = 95%, one can 
state that the true risk will not be higher than 0.1049. This result can 
be used in practice as follows. Given an observed set of i.i.d. samples 
V n = {(Xi,Yi)}f^{ cRxI, compute the tube T m ^ s = wx ± i with i > 0, 
w £ K and lZ n {T m)S ]T> n ) = 0. When a new sample Xj £ M arrives, then 
predict that the corresponding Yj £ M. will lie in the interval wXj ± t. Then 
we are reasonably sure (with a probability of 0.95) that this assertion will 
hold in at least 89.51% of the cases when the number n v of samples of data 
{Xj}"^ goes to infinity. 

A similar result can be obtained using the classical theory of non-parametric 
tolerance intervals, as initiated in [9], see e.g. [10]. 
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Corollary 1 (Bound by Order Statistics) LetT> n bei.i.d. samples from 
a fixed but unknown joint distribution Fxy- Consider the class of tubes 
T where each tube T miS is uniquely determined by D appropriate samples. 
Then, with probability higher than 1 — 5 < 1, the following inequality holds 
for any T m)S where TZ n (T m s ;V n ) = 0: 

P\ sup lZ{T myS] F XY ) >e] 

\R. n (T m ^;V n )=0 J 

< K n , D (T) (n(l - - (n - 1)(1 - e) n ) , (8) 

where K nt rj(T) is defined as in Theorem 1. 

Proof: Consider at first a fixed tube T^ s . After projecting all samples 
{(Xj,li)}™ =1 to the univariate sample Ri = m(Xj) — Yi, it is clear that a 
minimal tube with fixed m will have borders min(i?j) and max(i?j). Note 
that now P(R [min(i?j), max(i?j)]) equals TZiT^ s ; Fxy) ■ Application of 
the standard results as in [9] for such tolerance intervals gives 

P(P{R [min(i? 4 ), max(-Ri)]) > e) < n(l - e)"" 1 - (n - 1)(1 - e) n (9) 

Application of the union bound over all hypothesis T as in © gives the 
result. 

□ 

Remark that this bound is qualitatively very similar to the previous one. As 
a most interesting aside, the previous result implies a generalization bound 
on the minimal convex hull, i.e. a bound on the probability mass contained 
in the minimal Convex Hull (CH) of an i.i.d. sample. We consider the pla- 
nar case, the extension to higher dimensional case follows straightforwardly. 
Formally, one may define the minimal planar convex hull CH(2? n ) of a sam- 
ple T> n = {(Xi,Yi)}f =1 as the minimal subset of R x R containing all samples 
(Xi,Yi) £ R x R, and all convex combinations of any set of samples. 

Theorem 2 (Probability Mass of the Planar Convex Hull) Let V n con- 
tain i.i.d. samples of a random variable (X, Y) C R x R. Then with proba- 
bility exceeding 1 — 5 < 1, the probability mass outside the minimal convex 
hull CH(£> n ) is bounded as follows 

Pi(X.Y) ( CHW.)) < SKEW- 1.5122 -kgffl 

n — 6 
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Proof: The key element of the proof is found in the fact that the CH is the 
intersection of all linear support tubes in T with minimal (constant) width 
having zero empirical risk. Let #CH(D n ) denote this intersection, formally, 

(I,y)e#CH(P n )«yGT m , s (I), VT m>s : TZ n (T m y,V n ) = 0. (11) 

Now we proof that #CH(P n ) = CR(V n ). Assume at first that #CH(£> n ) C 
CH(Z>„), then a point (X,Y) G CYL(V n ) exists where (X,Y) #CH(£> n ), 
but this is in contradiction to the assertion that CH(I? n ) should be minimal: 
indeed also #CH(£>„) is convex (an intersection of convex sets), and contains 
all samples by construction. 

Conversely, assume that CH(V n ) C #CH(Z?„), then a point (X,Y) € 
#CH(£> n ) exist where (X,Y) CH(X> n ), and the point (X,Y) is included 
in all tubes T m ,s having TZ n (T mtS ;T> n ) = 0. By definition of the convex 
hull (X,Y) T> n , neither can it be a convex combination of any set of 
samples. Now, by the supporting hyperplane theorem (see e.g. [11]), there 
exists a linear hyperplane separating this point from the minimal convex 
hull. Constructing a tube T miS where m + s equals this supporting plane, 
and with width large enough such that TZ n (T mtS ;V n ) = contradicts the 
assumption, proving the result. 

Now, note that by definition the following inequality holds 

P((X,Y) £#CH(P n )) = sup K(T mtS ;F XY ). (12) 

Moreover, the set of linear tubes in M? with fixed width can be characterized 
by a set containing exactly D = 3 samples as proven in the following section. 
Finally, specializing the result of Theorem 1 in © gives the result. 

□ 

Note that classically the expected probability mass of a CH is expressed 
in terms of the expected number of extremal points of the data cloud [12]. 
Interestingly, the literature on statistical learning studies the number of 
extreme points in estimators as an (empirical) measure of complexity of 
an hypothesis space, note e.g. the correspondence between Theorem 12 
in [4] and Theorem 2 in [12], and the coding interpretation of SVMs, see 
e.g. [4,7,8]. A disadvantage of the mentioned approach appears that the 
expected number of extremal points of the convex hull is a quantity which is 
difficult to characterize a priori (without seeing the data) , without presuming 
restrictions on the underlying distribution [5]. The key observation of the 
previous theorem is that this number can be bounded by decomposing the 
minimal convex hull as the intersection of a set of linear tubes. 
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2.3 Support Tubes and Mutual Information 

At first, a technical Lemma is proven which will play a major role in the 
main result of the paper stated below. 

Lemma 1 (Upper-bound to the Conditional Entropy) Let T mjS : R d 
V C R be a fixed tube, then one has 

H(Y\(X, Y) G T m , t (X)) < E[log(2s(X))]. (13) 

Proof: The proof follows from the following inequality, for a fixed x G M. d 
it holds that 

H(Y\Y G T m , s (x)) < log(2s(x)) (14) 

following the fact that the uniform distribution has maximal entropy over all 
distributions in a fixed interval. The conditional distribution is then defined 
as follows 

H(Y\(X,Y) G T m>s (X)) = f H(Y\X = x,Y G T m>s (x)) dF x {x) 

< J log(2s(x)) dF x (x), 



hereby proving the result. 



□ 



In the case H 2 {s = t,t G Mq }, one has H(Y\(X,Y) G T m , s {X)) < log(2t). 
The motivation for the analysis of the support tube is found in the following 
upper-bound to the mutual information based on a finite sample. 

Theorem 3 (Lower-bound to the Mutual Information) Given an hy- 
pothesis class of tubes T(Tii,7i2) and a set of i.i.d. samples T> n . Let 
e(S,D, n) as in equation (0) for a confidence exceeding 1 — 5 < 1, and assume 
that the corresponding probability of covering satisfies e(5,D,n) < 0.5. The 
following lower bound on the expected mutual information I(Y\X) holds with 
probability exceeding 1 — 5 

H(Y\X) < e(5, D, n)H(Y) + (1 - e)E[log(2s(X))] (15) 

and equivalently 

I{Y\X) > (1 - e(S, D, n))(H{Y) - E [log(2s(X))] ) - h(e(5, D, n)), (16) 



where Fx denotes the marginal distribution of X and h(-) is the entropy of 
a Bernoulli random variable with parameter e. 



S 



Proof: The proof of this inequality follows roughly the derivation of Fano's 
inequality as in e.g. [1]. Let the random variable U = g(X,Y,T m<s ) € {0, 1} 
be defined as U = I(Y £ T m>s (X)) with n i.i.d. samples {Ui = I(Yi £ T miS (Xi))}™ =l . 
Twice the application of the chain rule on the conditional entropy gives 

H(U,Y\X) = H{Y\X) + H(U\X,Y) = H{Y\X) (17) 
H(Y,U\X) = H(U\X)+H{Y\U,X) 

<H(U)+H(Y\U,X), (18) 

since U is a function of X and Y, the conditional entropy H(U\X, Y) = 0, 
and H(U\X) < H(U). Theorem 1 states that for T m>s with zero empirical 
risk, the actual risk satisfies E[{7] = TZ(T m<s ; Fxy) < ^(^,D,n) with proba- 
bility higher than 1 — 5, such that the quantity H(U) can be bounded with 
the same probability as 

H(U) < -e log(e) - (1 - e) log(l - e) 4 h(e), (19) 

because the entropy of a binomial variable is concave with maximum at 0.5 
and < e(5,D,n) < 0.5 by assumption, see e.g. [1]. 

Now, the second term of the rhs of (|18|) is considered. Note first that 
since H(Y) > H(Y\X, U = 0), it holds for all < a < e(5, D, n) < 0.5 that 

aH(Y) + (1 - a)H(Y\X, U = 0) 

< eH{Y) + (1 - e{5, D, n))H(Y\X, U = 0). (20) 

Hence, 

H{Y\U,X) = P{U = l)H(Y\X,U = 1) 
+P(U = 0)H(Y\X,U = 0) 
< p(U = l)H{Y) + P{U = 0)H(Y\X, U = 0) (21) 

< e(6, D, n)H(Y) + (1 - e{5, D, n))H(Y\X, U = 0) 

< e(8, D, n)H(Y) + (1 - e{5, D, n))E[log(2s(X))], (22) 

where the first inequality follows from H(Y\X, U = 1) < H(Y), and the 
second one from (|20l) and since P(U = 1) < e(5,D,n). The third inequality 
constitutes the core of the proof, following from the previous Lemma. Com- 
bining this inequality with f|19f) and the definition of mutual information, 
I{Y\X) = H(Y) - H{Y\X) yields inequality ([IS]). 

□ 
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In the case of the class of tubes with constant nonzero width 2t £ Wq, the 
inequality can be written as follows. With probability higher than 1 — 8 < 1, 
the following lower-bound holds 

I(Y\X) > (1 - e(6, D, n)) (H(Y) - log(2t)) - h(e(6, D, n)), (23) 

if e(<5, D, n) < 0.5. Maximizing this lower-bound can be done by minimizing 
the width t and maximizing the probability of covering (1 — e), since the 
unconditional entropy is fixed. 

Prom definition 1, it follows that a ST is not uniquely defined for a fixed 
Fxy- Prom the above derivation, a natural choice is to look for the most 
informative (and hence the least conservative) support tube as follows 

T^j s = argmin ||s|| s.t. T m ,s is a ST to Fxy- (24) 
T m , s eT(Hi,H2) 

where ||-|| denotes a (pseudo-) norm on the hypothesis space H2, proportional 
to the term E[log 2s(X)] of equation (|16p . Let the theoretical risk of a ST on 
F X y be defined as K(T m>a , F X y) = f P (Y T m _ s (x) | X = x) dF X - Given 
only a finite number of observations in T> n , the empirical counterpart is 
studied 

T m , s = argmin ||s|| Wl s.t. K n (T mtS ; V n ) = 0. (25) 
T m , s eT{ni,H2) 



2.4 Quantile Tubes 

The discussion can be extended to the case of quantile tubes of a level 
< a < 1. Assume we have an estimator which for a sample T> n returns a 
tube %n tS specified by exactly D samples such that at most \an\ samples 
violate the tube. The question how well this estimator behaves for novel 
samples is considered. Specifically, we bound the expected occurrence of a 
sample not contained in the tube T m>s as follows using Hoeffding's inequality 
as classical. 



Proposition 1 (Deviation Inequality for Quantile Tubes) When T> n 
contains n i.i.d. samples, and any hypothesis T m ,s can be represented by 
exactly D samples, one has with probability exceeding 1 — 5 < 1, one has 



/2£>log(^)-21og(§) 

K(T m y, F XY ) -a< lZ n {T m ^V n ) + 2\ ^Ail ( 26 ) 



■n 
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This proof follows straightforwardly from the Vapnik and Chervonenkis in- 
equality with K n< D(T) < (^°r) different hypotheses, see e.g. [4] or [5]. It is 
a straightforward exercise to use this result to derive a bound on the mutual 
information in the case of quantile tubes as previously. 

3 Linear Support/Quantile Vector Tubes 

Given the specified methodology, this section elaborates on a practical esti- 
mator and shows how to extend results to quantile tubes. Here we restrict 
ourselves to the linear model class 7i\ = {m : m(x) = x T w \ w G R rf } and 
the class of parallel tubes Ti.2 = {s : s(x) = t, t G M + } with constant width 
for clarity of explanation. Problem (|25p with r(R d ,R + ) can be casted as a 
linear programming problem as follows, 

(w, t) = argmin t s.t. - t < Yj - w T X { <t Vi = 1, . . . ,n. (27) 

w,t>a 

The more general case of QT requires an additional step: 

Lemma 2 (Quantile Vector Tubes) The following estimator (strictly) 
excludes at most C observations (quantile property,), while the functions 
w T x — t and w T x + t interpolate at least d+1 sample points (interpolation 
property,). If the underlying distribution Fxy is Lebesgue smooth and non- 
degenerate (hence no linear dependence between the variables and the vector 
of ones occur), exactly d+1 points are interpolated with probability 1. 

n 

(T w ,t, &) = arg min J c {t, &) = Ct + ^ & 

s.t. -t-&< w T Xi -Yi < t + & > Vi = 1, . . . , n. (28) 

Moreover, the observations which satisfy the inequality constraints exactly 
determine the solution completely (representer property,), hereby justifying 
the name of Support/ Quantile Vector Tubes in analogy with the nomencla- 
ture in support vector machines. 

Proof: The quantile property is proven as follows. Let af,a/ G M + 
be positive Lagrange multipliers \/i = l,...,re. The Lagrangian of the 
constrained problem ([32]) becomes £c(w, t, a + , a - , (3) = J^c(w,t,£i) — 

Eti - £?=i {w T x l - Yl + t + i l )- E? =1 «" {Yi - w T x l + t + £ i ). 
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The first order conditions for optimality become 
dC c 



()l o-c = Er=iK + + «D (a) 

(>Cr - n = JXi K r - 4) Xi (b) (29) 



_^ =0 ^l = («+ + a-) + /3 i . (c) 

Following the complementary slackness conditions (/3j£j = Vi = 1, . . . , n), 
if follows that = for data-points outside the tube (£j > 0). This together 
with condition ([291 a) and (f29l c) proofs the quantile property. 

The interpolation property follows from the fundamental lemma of a 
linear programming problem: the solution to the problem satisfies at least 
d + 1 + n inequality constraints with equality. If t 7^ 0, then at least d + 1 
constraints £j = should be satisfied as at most n constraints of the 2n 
inequalities of the form —t — < {w T Xi — Yi) and (w T Xi — Yi) < t + ^ 
can hold at the same time. If i = 0, the problem reduces to the classical 
least absolute deviation estimator, possessing the above property. Let x = 
(Xi, . . . , X n ) T G W nxd be a matrix and y = (Yi, . . . , Y n ) T G W 1 be a vector. 
If the matrix (ljy, x, y) £ j^«x(i+d+i) ^ g n0 nsingular (Fxy is non-degenerate) 
the solution to the problem (|32l) satisfies exactly n + d + 1 inequalities, and 
any two functions {w T x— t, w T x+t} can at most (geometrically) interpolate 
d + 1 linear independent points. 

Since a solution interpolates d + 1 (linear independent) points exactly 
under the above conditions, knowledge of which points - say S C {1, . . . , n} 
- implies the optimal solution w and i as 

w T X l ±t = Y i , VieS, (30) 

where ±t denotes whether the specific sample interpolates the upper- or 
lower function. This means that the solution can be represented as the set 
S together with a one-bit flag indicating the sign. To represent the solution, 
one as such needs [d + l)(ln(n) + 1) bits. The probability mass inside the 
tube is given by the value C which is known a priori. 

□ 

Note that a similar principle lies at the heart of the derivation of the z^-SVM 
[13]. The representer property is unlike the classical representer theorems 
for kernel machines, as no regularization term (e.g. \\w\\) occurs in the 
estimator. In the case of C — > 0, the estimator (|32p results in the smallest 
support tube. When C — > +00, the robust L\ norm is obtained [14], and 
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when C is such that t = e, the e-loss of the SVR is implemented. One has to 
keep in mind however that despite those computational analogies, the scope 
of interval estimation differentiates substantially from the L\ and the SVR 
point predictors. 

We now turn to the computationally more challenging task of estimating 
multiple condition quantile intervals at the same time. 

Proposition 2 (Multi-Quantile Vector Tubes) Consider the set of tubes 
defined as 



q-{m) 
m,s 



T 



T 
W X 



E 4 

fc=i 



k > 



T 
W X 



k=l 



(31) 



1=1 



where m(x) = w T x. The parameters w G M. d , t + = (to, . . . ,t m ) T € M. m+l 
and t~ = (tQ , . . . , ^ M m+1 can be found by solving the following convex 
programming (LP) problem 



m m n 

min j c (t+, r, C) = E + *D + E E^i + + t) 

w,t '* i=i l=i i=i 

\-^ l -t^<(w T X i -Y i )<t+ + ^ 1 
s.t. < 

Vl = l,...,m, Vi = l,...,n. (32) 

T/ien ewery solution excludes at most C\ datapoints ^generalized quantile 
property,), while the boundaries of all tubes pass through at most d+2(m + l) 
datapoints. 

Proof: The proof follows exactly the same lines as in Proposition 5, em- 
ploying the fundamental theorem of linear programming and the first order 
conditions of optimality. Note that by construction, the different quantiles 
are properly nested, i.e. not allowed to cross. 

□ 

Figure [2] gives an example of such a multi-quantile tube with a nonlinear 
function m which is a linear combination of localized basis-functions. This 
computational mechanism of inferring and representing the empirically opti- 
mal tube T m>s can be extended to data represented in a more complex metric 
(e.g. X cM. d where d — > oo, or by using reproducing kernels). Hereto, it is 
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Figure 2: Example of n = 250 a Multi-Quantile Vector Tube Tmi with a — 
(25,12,6,3,2,1). Here m consists of a linear combination of 10 localized basis- 
functions. 

easily seen that one needs another mechanism of restricting the hypothesis 
space Hi. Consider for example the class H\ p = {m(x) = w T x \\\w\\\ < p}, 
having a finite covering number (see e.g. [4]). The disadvantage in this case 
is on the one hand that one should should choose the regularization con- 
stant in an appropriate way a priori. On the other hand, the influence of 
the regularization term becomes nontrivial in both the theoretical as well as 
in the computational derivation. 

4 Conclusion 

This paper studied an intuitive estimator of the conditional support and 
quantiles of a distribution. The result is shown to be useful to estimate the 
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mutual information of the sample by extending the reach of Fano's theorem 
in combination with standard results of learning theory. It is indicated how 
the theoretical results relate to estimating the minimal convex hull. 
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